Knowledge Discovery in Molecular Structure Databases
نویسندگان
چکیده
In recent years there has been an explosion of information generated on molecular structures. This information is usually three-dimensional in nature, and is normally appended to the existing molecular database in the form of raw coordinate data. There is a recognized need for new tools to structure, manage, and analyze these data. This thesis presents a knowledge discovery technique, based on a theory of conceptual clustering, in which recurrent associations between molecular constitution and conformation can be discovered in three-dimensional molecular structure databases. These associations are de ned in rstorder logic as structured concepts, and are maintained in a taxonomy partially ordered by subsumption. The knowledge discovery method is applied to two large molecular structure databases. In the Cambridge Structural Database the technique is used to perform conformational clustering of several small molecule datasets. The resultant conformational templates are evaluated by comparison with hand analyses and statistical clusterings of the same data. In the Protein Data Bank the technique is used to discover recurrent protein motifs which have sequences predictive of their associated structure. The resultant rules are evaluated with respect to the average structural variation of their instances, and according to a test of statistical correlation between sequence and structure. In both molecular structure databases interesting and signi cant associations were discovered. i Acknowledgements My initial inspiration and excitement for the topic of knowledge discovery in molecular databases developed during intensive research meetings with Frank Allen, Suzanne Fortier, and Janice Glasgow at Queen's University in late 1990. Since then, and during my research, I have become increasingly fascinated by the topic. For their role in maintaining my interest, special thanks is due to my supervisors, Janice Glasgow and Suzanne Fortier. They have provided constant encouragement, enthusiasm, interest, and support for this work. Frank Allen at the Cambridge Crystallographic Data Centre has been a tremendous source of information and inspiration for this work. Much of Chapter 5 would not have been possible without his help in providing me with structural data and listings from statistical clustering experiments. I would also like to thank the Cambridge Crystallographic Data Centre for providing me with the opportunity to use their facilities during several research visits from 1992 through 1995. I would like to thank Gilles Bisson, Diane Cook, Bob Levinson and Kevin Thompson for their generous time and valuable comments on a draft of Chapter 2. Jude Shavlik, as editor of the Machine Learning journal, helped out a lot with the overall presentation and clarity of Chapter 6. Gregory Piatetsky-Shapiro o ered useful advice on the statistical evaluation of predictive rules. Several groups of people have listened to presentations of this thesis material. For their generous time and useful comments, I would like to thank the Department of Information Studies at She eld University, the Ottawa Machine Learning Group, the Biomolecular Informatics group at ZymoGenetics Inc. and the Molecular Scene Analysis Group at Queen's University. The prototype software developed to support this thesis was done in the Q'Nial array programming language. I would like to thank the creator of the language, Mike Jenkins, for taking an interest in my work and for ensuring that I always had the best programming facilities to work with. Chris Walmsley has given me all kinds of technical advice and support; in particular, I would like to acknowledge his MolView software which was used to create some of the gures in Chapters 5 and 6. Figure 5.10 was prepared using the daVinci graph layout system. The diagrams in Figure II.1 were prepared using the QUEST and PLUTO programs from the Cambridge Structural Database. This thesis is dedicated to my grandmother, Phyllis Mobray Hughes (1906-1990). ii
منابع مشابه
Data Mining and Knowledge Discovery in Molecular Databases - Session Introduction
The development and growth of molecular databases over the last decade has brought a growing problem to the biocomputing community. Our ability t o analyze, summarize and extract information from these databases has lagged far behind our ability to collect and store data. As well, traditional methods for handling data either automated or manual cannot be eeectively applied because of the volume...
متن کاملبررسی کاربردهای داده کاوی در نظام سلامت
Introduction: Extensive amounts of data stored in medical databases require the development of specialized tools for accessing the data, data analysis, knowledge discovery, and the effective use of the data. Data mining is one of the most important methods. The article sketches the used Data Mining techniques, and illustrates their applicability to medical diagnostic and prognostic problems. ...
متن کاملVAMMPIRE: a matched molecular pairs database for structure-based drug design and optimization.
Structure-based optimization to improve the affinity of a lead compound is an established approach in drug discovery. Knowledge-based databases holding molecular replacements can be supportive in the optimization process. We introduce a strategy to relate the substitution effect within matched molecular pairs (MMPs) to the atom environment within the cocrystallized protein-ligand complex. Virtu...
متن کاملA Framework for Knowledge Discovery and Evolution in Databases
|Although knowledge discovery is increasingly important in databases, discovered knowledge is not always useful to users. It is mainly because the discovered knowledge does not t user's interests, or it may be redundant or inconsistent with a priori knowledge. Knowledge discovery in databases depends critically on how well a database is characterized and how consistently the existing and discov...
متن کاملThe evolving role of information technology in the drug discovery process.
Information technologies for chemical structure prediction, heterogeneous database access, pattern discovery, and systems and molecular modeling have evolved to become core components of the modern drug discovery process. As this evolution continues, the balance between in silico modeling and 'wet' chemistry will continue to shift and it might eventually be possible to step through the discover...
متن کاملAnalyzing Inconsistency Toward Enhancing Integration of Biological Molecular Databases
The rapid growth of biological databases not only provides biologists with abundant data but also presents a big challenge in relation to the analysis of data. Many data analysis approaches such as data mining, information retrieval and machine learning have been used to extract frequent patterns from diverse biological databases. However, the discrepancies, due to the differences in the struct...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008